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Residential Demand Response Applications 
Using Batch Reinforcement Learning 

F. Ruelens, BJ. Claessens, S. Vandael, B. De Schutter, R. Babuska and R. Belmans. 


Abstract —Driven by recent advances in batch Reinforcement 
Learning (RL), this paper contributes to the application of batch 
RL to demand response. In contrast to conventional model- 
based approaches, batch RL techniques do not require a system 
identification step, which makes them more suitable for a large- 
scale implementation. This paper extends fitted Q-iteration, a 
standard batch RL technique, to the situation where a forecast of 
the exogenous data is provided. In general, batch RL techniques 
do not rely on expert knowledge on the system dynamics or the 
solution. However, if some expert knowledge is provided, it can 
be Incorporated by using our novel policy adjustment method. 
Finally, we tackle the challenge of finding an open-loop schedule 
required to participate in the day-ahead market. We propose a 
model-free Monte-Carlo estimator method that uses a metric to 
construct artificial trajectories and we illustrate this method by 
finding the day-ahead schedule of a heat-pump thermostat. Our 
experiments show that batch RL techniques provide a valuable 
alternative to model-based controllers and that they can be used 
to construct both closed-loop and open-loop policies. 

Index Terms —Batch reinforcement learning. Demand re¬ 
sponse, Electric water heater. Fitted Q-lteratlon, Heat pump. 


I. Introduction 

T he increasing share of renewable energy sources intro¬ 
duces the need for flexibility on the demand side of the 
electricity system III. A prominent example of loads that 
offer flexibility at the residential level are thermostatically 
controlled loads, such as heat pumps, air conditioning units, 
and electric water heaters. These loads represent about 20% of 
the total electricity consumption at the residential level in the 
United States fH. In addition, their market share is expected 
to increase as a result of the electrification of heating and 
cooling El, making them an interesting domain for optimiza¬ 
tion methods Q, El-El. Demand response programs offer 
demand flexibility by motivating end users to adapt their con¬ 
sumption profile in response to changes in the electricity price 
or other grid signals. The forecast uncertainty of renewable 
energy sources Q, combined with their limited controllability, 
have made demand response the topic of an extensive number 
of research projects fH, Q, El scientific papers El, El , 
Ei-im. The traditional control paradigm defines the demand 
response problem as a model-based control problem El, El, 
El, requiring a model of the demand response application, 
an optimizer, and a forecasting technique. A critical step in 
setting up a model-based controller comprises selecting accu¬ 
rate models and estimating the model parameters. This step 
becomes more challenging considering the heterogeneity of 
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the end users and their different patterns of behavior ifT^ . As a 
result, different end users are expected to have different model 
parameters and even different models. As such, a large-scale 
implementation of model-based controllers requires a stable 
and robust approach that is able to identify the appropriate 
model and the corresponding model parameters. 

Reinforcement Learning (RL) ifTSl . lfT4l . on the other hand, 
is a model-free technique that requires no system identification 
step and no a priori knowledge. Recent developments in the 
held of reinforcement learning show that RL techniques can 
replace or supplement model-based techniques lITSI . A number 
of recent papers provide examples of how a popular RL 
method, Q-learning ifTSI . can be used for demand response Q, 
El, El, E3. For example in EOl . O’Neill et al. propose 
an automated energy management system based on Q-learning 
that learns how to make optimal decisions for the consumers. 
In El, Henze et al. investigate the potential of Q-learning 
for the operation of commercial cold stores and in lH, Kara 
et al. use Q-learning to control a cluster of thermostatically 
controlled loads. In E3, Lee et al. propose a bias-corrected 
form of Q-learning to operate battery charging in the presence 
of volatile prices. While being a popular method, one of the 
fundamental drawbacks of Q-learning is its inefficient use of 
experiences, given that Q-learning discards the current data 
sample after every update. As a result, more observations are 
needed to propagate already known information through the 
state space. In order to overcome this drawback, batch RL 
techniques Ea-Ei can be used. In batch RL a controller 
estimates a control policy based on a batch of experiences. 
These experiences can be a fixed set ll20l or can be gathered 
online by interacting with the environment ED- Given that 
batch RL algorithms can reuse past experiences, they converge 
faster compared to techniques like Q-learning or SARSA. This 
makes batch RL techniques suitable for practical implemen¬ 
tations, such as demand response. For example, the authors 
of II 22 I combine Q-learning with eligibility traces in order to 
learn the consumer and time preferences of demand response 
applications. In El, the authors use a batch RL technique 
to schedule a cluster of electric water heaters and in ll23l . 
Vandael et al. use a batch RL technique to And a day-ahead 
consumption plan of a cluster of electric vehicles. An excellent 
overview of batch RL methods can be found in 11211 and ll24l . 

Inspired by the recent developments in batch RL, in par¬ 
ticular fitted Q-iteration by Ernst et al. ES, this paper builds 
upon the existing batch RL literature and contributes to the 
application of batch RL techniques to residential demand 
response. The contributions of our paper can be summarized 
as follows: (1) we demonstrate how fitted Q-iteration can be 
extended to the situation when a forecast of the exogenous 
data is provided; (2) we propose a policy adjustment method 
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Fig. 1. Building blocks of a batch Reinforcement Learning (RL) controller 
in a demand response application. 


that exploits general expert knowledge about monotonicity 
conditions of the control policy; (3) we introduce a model-free 
Monte Carlo estimator method to find a day-ahead consump¬ 
tion plan by making use of a novel metric based on Q-values. 

This paper is structured as follows: Section |II] defines the 
building blocks of our batch RL controller. Section |III] formu¬ 
lates the problem as a Markov decision process. Section m 
describes our model-free batch RL techniques for demand 
response. Section [Vl demonstrates the presented techniques in 
a realistic demand response setting. To conclude. Section IVTl 
summarizes the results and discusses further research. 

IT Building blocks: model-free approach 

Fig. [T] presents a general overview of the different building 
blocks of our batch RL approach applied to a demand response 
setting. This paper focuses on two types of thermostatically 
controlled loads. The first type is a residential electric water 
heater with a stochastic hot-water demand 1251 . The dynamic 
behavior of the electric water heater, used in this paper, is 
modeled by making use of a nonlinear stratified thermal tank 
model as described in ll26l . Our second demand response 
application is a heat-pump thermostat for a residential build¬ 
ing. The temperature dynamics of the building are modeled 
using a second-order equivalent thermal parameter model lIZTl . 
describing the temperature dynamics of the indoor air and 
of the building envelope. In order to develop a practical 
implementation we assume that the temperature of the build¬ 
ing envelope is a hidden state variable, and thus cannot be 
measured. In addition, we assume that both applications are 
equipped with a backup controller that guarantees the comfort 
and safety settings of the end users. The backup controller 
can be a built-in overrule mechanism that turns the application 
ON or OFF depending on the current state and a predefined 
switching logic. The operation and settings of the backup 
controller are assumed to be unknown, however, the batch RL 
controller can measure the action of the backup controller (see 
dashed arrow in Fig. [T]). 

Before the observed state information of the demand re¬ 
sponse application can be sent to the batch RL algorithm, 
we apply a feature extraction technique iflTll . A first task of 
the feature extraction technique is to extract non-observable 
state information that is required to obtain a policy. For 
example, in our implementation of a heat-pump thermostat 
we use feature extraction to represent the temperature of the 
building envelope, which cannot be measured. A second task 
of the feature extraction technique could be to find a compact 


representation of the observable state. For example, in the case 
of an electrical water boiler, feature extraction is used to find 
a compact representation of the observable state. 

At the start of each day the batch RL controller constructs 
a control policy for the next day, given a fixed batch of 
transitions and cost values. The batch RL controller needs no 
a priori information on the model dynamics and considers its 
environment a black box. As a result, the batch RL controller 
can be applied to virtually every demand response problem 
or even for cluster control S. During the day, an exploration 
strategy is used online to interact with the environment and 
to collect new transitions that are added systematically to the 
batch. 

The goal of this paper is to develop a model-free controller 
for two relevant demand response business models m, Ell: 
dynamic pricing and day-ahead scheduling. The objective of 
the first business model is to adapt the consumption profile 
in response to an external price signal without violating the 
comfort settings of the end user. The optimal solution is a 
closed-loop control policy that is a function of the current 
and past measurements of the state. The second business 
model relates to the participation in the day-ahead market. The 
objective is to construct the day-ahead consumption plan and 
then follow it during the day. The goal is to minimize the cost 
in the day-ahead market and minimize any deviation between 
the day-ahead consumption plan and the actual consumption. 
In contrast to the solution of the first business model, the 
day-ahead consumption plan is a feed-forward plan for the 
next day, i.e. an open-loop policy, which does not depend on 
measurements of the state. 


III. Markov decision process formulation 


In order to use reinforcement learning techniques, we for¬ 
mulate the sequential decision problem of a demand response 
application as a Markov decision process m, m. The 
Markov decision process, used in this paper, is defined by its 
d-dimensional state space X C its action space U C M., 
its stochastic discrete-time transition function /, and its cost 
function p ifTsl . The optimization horizon is considered finite, 
comprising T S N \ {0} steps, where at each discrete time 
step k, the state evolves as follows: 

Xk+i = fixk^UkjWk) Vk G {!,■■■ ,T - 1}, (1) 


with Wk a realization of a random process drawn from a 
conditional probability distribution pw{'\xk), Uk G U the 
control action, and Xk G X the state. Associated with each 
state transition, a cost Ck is given by: 

Ck = pixk,Uk,Wk) Vk G {!,■ ■ ■ ,T}. (2) 


The goal is to find a control policy h* : X ^ U that minimizes 
the expected T-stage return for any state in the state space. 
The expected T-stage return starting from xi and following 
h* is defined as follows: 


(2^1) 


E 

■Wk~pwi-\xk) 


■ T 

y^^p{xk,h*{xk),Wk) 

.k=l 


(3) 
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A convenient way to characterize the policy h* is by using a 
state-action value function or Q-function: 


ix,u) = E 

w~pvv{-\x) L 


p{x,u,w) + j!f {f{x,h*{x),w)) 


(4) 


The Q-function is the cumulative return starting from state x, 
taking action u, and following h* thereafter. Starting from a 
Q-function for every state-action pair, the policy is calculated 
as follows: 

h*{x) G arg min {x,u), (5) 

ueu 

where h* satisfies the Bellman equation 1^ . The next para¬ 
graphs give a formal description of the state space, the backup 
controller, and the cost function tailored to demand response. 


A. State description 

The state space X is spanned by a time-dependent state 
space component X^, a controllable state space component 
ATph, and an uncontrollable exogenous state space component 

^ex na: 

X = Xt X Xph X Xex. (6) 

1) Timing: The state space component Xt describes the 
part of the state space related to timing, i.e. it carries timing 
information that is relevant for the dynamics of the system: 

Xt = X^xXt with X^ = {1,..., 96} , X^ = {1,..., 7} , (7) 

where x^ G X^ denotes the quarter in the day, and aij G 
X^ denotes the day in the week. The rationale is that most 
consumer behavior tends to be repetitive and follows a diurnal 
pattern. 

2) Physical representation: The controllable state space 
component Xph represents the physical state information 
related to the quantities that are measured locally and that 
are influenced by the control actions, e.g. the indoor air 
temperature or the state of charge of an electric water heater: 

:^ph G Xpi^ with Xpj^ ^ Xpi^ tCpu, (8) 

where Xpjj and Xph denote the lower and upper bound, set to 
guarantee the comfort and safety of the end user. 

3) Exogenous Information: The state description of the 
uncontrollable exogenous state is split into two components: 

Xex = XPh X X^^,. (9) 

When the random disturbance Wk+i is independent of Wk, 
given Xk there is no need to include an uncontrollable ex¬ 
ogenous state information in the state space. However, most 
physical processes, such as the outside temperature and solar 
radiation, exhibit a certain degree of autocorrelation, where 
the next state depends on the previous states. For this reason 
we include an exogenous state space component ccP}' G Xp}“ 
in our state space description m. This exogenous state space 
component is related to the observable exogenous information 
that has an impact on the physical dynamics and cannot 
be influenced by the control actions. The second exogenous 
state space component x^y. G X^^ has no direct influence on 
the dynamics, but contains information to calculate the cost 


Cfc. This work assumes that a deterministic forecast of the 
exogenous state information related to the cost A G is 
provided for the time span covering the optimization problem. 


B. Backup controller 

In order to develop a practical demand response technology 
we assume that each device is equipped with an overrule 
mechanism that guarantees comfort and safety constraints. 
The backup function B : X x U —maps the requested 
control action Uk € U taken in state Xk to a physical control 
action mP^ G t/P^: 

= B{xk,Uk). ( 10 ) 

The settings of the backup function B are unknown by the 
batch RL controller, but the resulting action can be 
measured (see dashed arrow in Fig. [TJ. 

C. Cost function 

In general, RL techniques do not require detailed knowledge 
of the cost function. However, for most demand response busi¬ 
ness models a cost function is available. This paper considers 
two typical cost functions related to demand response. In the 
dynamic pricing scenario an external price profile A G is 
known deterministically at the start of the horizon. The cost 
function is described as: 

Ck=ul'^XkAt, (11) 

where Xk is the electricity price at time step k and At is the 
length of a control period. The objective of the second business 
case is to determine a day-ahead consumption plan and to 
follow this plan during operation. The day-ahead consumption 
plan should be minimized based on day-ahead prices. In 
addition, any deviation between the planned consumption and 
actual consumption should be avoided. As such, the cost 
function can be written as: 

Ck = UkXkAt + a\ukAt - At\, (12) 

where Uk is the planned consumption, is the actual 
consumption and Xk is the forecasted day-ahead price. The 
first part of (fT^ is the cost for buying energy at the day-ahead 
market. The second part defines a penalty for any deviation 
between the planned consumption and the actual consumption. 


D. Reinforcement learning for demand response 

When the description of the transition function and cost 
function is available, techniques that make use of the Markov 
decision process framework, such as approximate dynamic 
programming IMI or direct policy search ll24l . can be used to 
And near-optimal policies. However, in our implementation we 
assume that the transition function /, the backup controller B, 
and the underlying probability of the exogenous information 
are unknown. For this reason, we present a model-free batch 
RL approach that builds on previous theoretical work on RL, 
in particular fitted Q-iteration ll20l . expert knowledge m, and 
the synthesis of artificial trajectories ESI. 
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Algorithm 1 Fitted Q-iteration using a forecast of the exoge¬ 
nous data (extended FQI) 

Input: T = {(xz, x;, A 

1: TV 0 

2 : let Qq be zero everywhere on X x ?7 

3 repeat 

4 : for T = 1 , • • • , do 

5: Cl ^ p{Xl,uf^,X) 

6: x'l ^ {xfJ,xf^',Xi'p^,xf^J) > replace the 

observed exogenous part of the next state 
by its forecasted value if 
7: Qn,i ^ Cl + min Qn-i{x'i,u) 

uGU 

s end for 

9: use regression to obtain Qn from 

Treg = {{{xi, Ul), Q N j) , I = I,'"' , f^F} 

10 : increment N 

11 : until stopping criterion is reached 

Output: Q* = Qn 


IV. Algorithms 

Typically batch RL techniques construct policies based on 
a batch of tuples of the form: F = {{xi,ui,x\,ci)}f^, where 
xi — (xfj,xfj,x;_ph,a:fg^) denotes the state at time step I 
and x'l denotes the state at time step T + 1. However, for most 
demand response applications, the cost function p is given 
a piori, and of the form p(a:;, rtf^, A). As such, this paper 
considers tuples of the form {xi,ui,x'i,uf^). 

A. Fitted Q-iteration using a forecast of the exogenous data 

Here we show how fitted Q-iteration ll20l can be extended to 
the situation when a forecast of the exogenous state space com¬ 
ponent is provided (Algorithm [TJ. The algorithm iteratively 
builds a training set Feg with all state-action pairs {x, u) in 
F as the input. The target values consist of the corresponding 
cost values p{x, uP**, A) and the optimal Q-values, based on the 
approximation of the Q-function of the previous iteration, for 
the next states min Qiy-i{x'i,u). For a finite horizon problem 

u^U 

the stopping criterion is reached when N = T, where T is the 
number of control periods in the optimization horizon and N 
denotes the iteration. It is important to note that denotes 
the successor state in F, where the observed exogenous state 
space information xf^,^ is replaced by its forecasted value 
xf^x (line |6] in Algorithm [TJ. Note that in our algorithm the 
next state contains information on the forecasted exogenous 
data, whereas for standard fitted Q-iteration ll20l the next state 
contains past observations of the exogenous data. By replacing 
the observed exogenous part of the next state by its forecasted 
value, the Q-function of the next state assumes that the 
exogenous information will follow its forecast. The proposed 
algorithm is relevant for demand response applications that 
are influenced by exogenous weather data. Examples of these 
applications are heat-pump thermostats and air conditioning 
units. 

In principle, any regression algorithm, such as neural 
networks 1^ . can be applied in combination with fitted 


Q-iteration. However, because of their robustness and fast 
calculation time, an extremely randomized trees ensemble 
method ll20l is used. 

B. Expert policy adjustment 

Given the Q-function from Algorithm [T] a near-optimal 
policy can be constructed by solving (|5j for every state in the 
state space. In this section, we show how expert knowledge on 
the monotonicity of the policy can be exploited to regularize 
the policy. The method enforces monotonicity conditions by 
using a convex optimization to approximate the policy, where 
expert knowledge is included in the form of extra constraints. 
These constraints can result directly from the expert or from a 
model-based solution. In order to define a convex optimization 
problem we use a fuzzy model with triangular membership 
functions ll24l to approximate the policy. The centers of the 
triangular membership functions are located on an equidistant 
grid with Ng membership functions along each dimension of 
the state space. This partitioning leads to Ng state-dependent 
membership functions for each action. The parameter vector 9* 
that approximates the original policy can be found by solving 
the following least-squares problem: 

re arg^min^ ([F(6»)](a:i) - r(xi)) , 

s.t. expert knowledge 

where F denotes an approximation mapping of a weighted 
linear combination of triangular membership functions and 
[F'(6>)](x) denotes the policy F{6) evaluated at state x. Let 
h* be the policy obtained by solving given the the Q- 
function obtained by Algorithm (Tj A more detailed description 
of how these triangular membership functions are defined can 
be found in ll24l . The fuzzy approximation of the policy allows 
us to add expert knowledge to the policy in the form of 
convex constraints of the least-squares problem defined in (HaJ . 
which can be solved using a convex optimization solver. Using 
the same notation as in ED, we can enforce monotonicity 
conditions along the dth dimension of state space as follows: 

5d[F{9)]{xd) < 6d[F{9)]{x'd) (14) 

for all state components Xd < x'd along the dimension d. 
If 6d is -1 then [r(0)] will be decreasing along the dth 
dimension of X, whereas if Sd is 1 then [r(0)] will be 
increasing along the dth dimension of X. Once 9* is found, 
the adjusted policy h, given this expert knowledge, can be 
calculated as h{x) = [E'(0*)](x). When the batch F contains 
a limited number of tuples, e.g. only a few days, the expert 
policy adjustment method can be used to improve the quality 
of the policy of a demand response problem. 

C. Day-ahead consumption plan 

This section explains how to construct a day-ahead schedule 
starting from the Q-function obtained by Algorithm 1. Finding 
a day-ahead schedule has a direct relation to two situations: 
1) a day-ahead market, where participants have to submit a 
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day-ahead schedule one day in advance of the actual consump¬ 
tion 1^ : 2) a distributed optimization process, where two or 
more participants are coupled by a common constraint, e.g. 
congestion management ||9l- Algorithm |2] describes a model- 
free Monte Carlo estimator method ESI for policy evaluation 
that makes use of a metric based on Q-values. The method 
estimates the average return of a policy by synthesizing 
p sequences of transitions of length T from T. These p 
sequences can be seen as a proxy of the actual trajectories that 
could be obtained by simulating the policy on the given control 
problem. Note that since we consider a stochastic setting, p 
needs to be greater than 1. A sequence is grown in length 
by selecting a new transition among the samples of not-yet- 
used one-step transitions in T. Each new transition is selected 
by minimizing a distance metric with the previously selected 
transition. 

In ED, Fonteneau et al. propose the following distance 
metric in XxU : A {{x, x '), (m, u')) = \\x — a;'|| -I- ||m — u'\\, 
where || • || denotes the Euclidean norm. It is important to 
note that this metric weighs each dimensions of the state 
space equally. In order to overcome specifying weights to 
each dimension, we propose a distance metric as specified 
on line 0 of Algorithm |2] Here Q* is obtained by applying 
Algorithm [Hand x\ denotes the state corresponding to the ith 
trajectory at time step k. The artificial trajectory P* contains 
the control actions corresponding with the optimal Q-value, 
given the state x]. (see line [7]). The next state x'^i is found by 
taking the next state of the tuple that minimizes the distance 
metric (see line [8]l. The regularization parameter ^ is a scalar 
that is included to penalize states that have similar Q-values, 
but have a large Euclidean norm in the state space. When the 
Q-function is strictly increasing or decreasing ^ can be set to 0. 
The motivation behind using Q-values instead of the Euclidean 
distance in A x 17 is that Q-values capture the dynamics of 
the system and, therefore, there is no need to select individual 
weights. 

V. Simulations 

This section presents the simulation results of three ex¬ 
periments and evaluates the performance of the proposed 
algorithms. We focus on two examples of flexible loads, 
i.e. an electric water heater and heat-pump thermostat. The 
first experiment evaluates the performance of extended FQI 
(Algorithm 1) for a heat-pump thermostat. The rationale 
behind using extended FQI for a heat-pump thermostat, is 
that the temperature dynamics of a building is influenced 
by exogenous weather data, which is not the case for an 
electric water heater. In the second experiment, we apply the 
policy adjustment method to an electric water heater. The final 
experiment uses the model-free Monte Carlo method to find 
a day-ahead consumption plan for a heat-pump thermostat. It 
should be noted that the policy adjustment method and model- 
free Monte Carlo method can also be applied to both heat- 
pump thermostat and electric water heater. 

A. Thermostatically controlled loads 

Here we describe the state definition and the settings of the 
backup controller of the electric water heater and the heat- 


Algorithm 2 Model-free Monte Carlo method lITSl . 

Input: = {{xuui,x\,uf')}f^^, xi, p, ^ 

l : Q i — J' 

2 : Apply Algorithm [U to obtain Q* 


3 for I = 1, • ■ • ,p do 

4: fc ^ 1 


5: xl t— Xl 

6: while k <T do 

7: ul ^ arg min Q*{x\,u') 

u'eu 

s: H ■It- arg min \Q*{xl,ul) - Q*{xi,ul.)\ -f 

(xi,ui,x[,uf^)G0 

^\\xi-xi\\ 

9: T -(r- lowest index in Q of the transitions in TL 

10 : PI ^ Ul 

11; k — k \ 

12: xl ^ X;, 

13: g ^ g\i^{xii,uii,x[i,u^^)'^ 

14 end while 

15 end for 


Output: pi,- ,PP 


pump thermostat. 

1) Electric water heater: We consider that the storage tank 
of the electric water heater is equipped with Us temperature 
sensors. The full state description of the electric water heater 
is defined as follows: 


Xk = {xl,,Tl--- 


(15) 


where x’^ ^ denotes of the current quarter in the day and 
denotes the temperature measurement of the ith sensor. This 
work uses a feature extraction to reduce the dimensionality 
of the controllable state space component by replacing it with 
with the average sensor measurement. As such the reduced 
state is defined as follows: 


Xk = (. 


E ris rpi 




(16) 


More generic dimension reduction techniques, such as an auto¬ 
encoder network and a principle components analysis ll33l . 
ll34l will be explored in future research. The logic of the 
backup controller of the electric water heater is defined as: 


B{xk,Uk) 


— timax if 

ul^= Uk if 

uf= 0 if 


Xsoc <30% 

30%< Xsoc <100%. 
Xsoc >100% 

(17) 


The electric heating element of the electric water heater can 
be controlled with a binary control action Uk € {0, Umax}, 
where Mmax = 2.3kW is the maximum power. A detailed 
description of the nonlinear dynamics and calculation of the 
state of charge Xsoc is out of the scope of this paper and can 
be found in Il26ll . The stratified thermal tank model consists 
of 50 layers of which 8 are measured to construct the full 
state ll2^ . We use a set of hot-water demand profiles with 
a mean load of 100 1/day obtained from ll25l to simulate a 
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realistic tap demand. 

2) Heat-pump thermostat: Our second application con¬ 
siders a heat-pump thermostat that can measure the indoor 
air temperature, the outdoor air temperature, and the solar 
radiation. The full state of the heat-pump thermostat is defined 
as follows: 

) ^ICjOut: ) 7 (1^) 


where the physical state space component consists of the 
indoor air temperature in and the exogenous state space 
component contains the outside air temperature T^ out and 
the solar radiation Sk- The internal heat gains, caused by 
user behavior and electric appliances, cannot be measured and 
are not included in the state. We included a measurement of 
the solar radiance in the state since solar energy transmitted 
through windows can significantly impact the indoor temper¬ 
ature dynamics. In order to have a practical implementation, 
we consider that we cannot measure the temperature of the 
building envelope. Similar to IIT4l . we used a feature extraction 
technique based on past state observations by including a 
virtual building envelope temperature, which is a running 
average of the past rir air temperatures: 


Xk — (2^2 tJ '^k,j 


E k—1 rp 

i^k-Tlr 


.T, 


fc.out 


Sk 


(19) 


An alternative method could be to use a model-based technique 
to estimate the temperature of the building envelope. The 
backup controller of the heat-pump thermostat is defined as 
follows: 

(ul^ = Umax if Tfcjn < Hk 

B{xk,Uk) = ■! Mfe*' = Uk if Tfc < Tk^in < Tk, (20) 

Uf= 0 if Tk,in>Tk 


where Tfc and Tk are the minimum and maximum temperature 
settings defined by the end user. The action space of the 
heat pump is discretized in 10 steps, Uk € {0,..., Umax}, 
where Umax = 3kW is the maximum power. In the simulation 
section we define a minimum and maximum comfort setting 
of 19°C and 23°C. A detailed description of the temperature 
dynamics of the indoor air and building envelope can be found 
in EU. The exogenous information consists of the outside 
temperature, the solar radiation, and the internal heat gains 
from a location in Belgium JT], ll^ . 

In the following experiments we define a control period 
of one quarter and an optimization horizon T of 96 control 
periods. 


B. Experiment 1 

The goal of the first experiment is to compare the per¬ 
formance of Fitted Q-Iteration (standard FQI) 120) to the 
performance of our extension of FQI (extended FQI, given 
by Algorithm [T]l. The objective of the considered heating 
system is to minimize the electricity cost of the heat pump by 
responding to an external price signal. The electricity prices 
are taken from the Belgian wholesale market ll28l . We assume 
that a forecast of the outside temperature and solar radiance 
is available. Since the goal of the experiment is to assess the 
impact on the performance of FQI when a forecast is included. 



10 20 30 40 50 60 70 80 
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Fig. 2. Simulation results for a heat-pump thermostat and a dynamic pricing 
scheme using an optimal controller, Fitted Q-Iteration (FQI) with and without 
forecast, and a default controller. The top plot depicts the performance metric 
M and the bottom plot depicts the cumulative electricity cost. 


we assume perfect forecasts of the outside temperature and 
solar radiance. The observed state information of the controller 
is defined by ( [19] ) with set to 3, and the cost function is 
given by dlB. During the day, both FQI controllers use an e- 
greedy exploration strategy. This exploration strategy selects 
a random control action with probability Sk and follows the 
policy with probability l—Sk - The exploration probability Sk is 
decreased on a daily basis following a harmonic sequence ll^ . 
At the end of each day the FQI controller adds the tuples of 
the previous day to the batch and computes a policy for the 
next T time steps ll^ . 

In order to compare the performance of the FQI controllers 
we define the following metric: 

Cfqi-Cdc ^ (21) 

Coc Cdc 

where Cfqi denotes the daily cost of the FQI controller, Cdc de¬ 
notes the daily cost of the default controller and Coc denotes the 
daily cost of the optimal controller. The metric M corresponds 
to 0 if the FQI controller obtains the same performance as the 
default controller and corresponds to 1 if the FQI controller 
obtains the same performance as the optimal controller. The 
default controller is a hysteresis controller that switches on 
when the indoor air temperature is lower than 19°C and stops 
heating when the indoor air temperature reaches 20° C. The 
optimal controller is a model-based controller that has full 
information on the model parameters and has perfect forecasts 
of all exogenous information. 

The simulation results of the heating system for a simulation 
horizon of 80 days are depicted in Fig. |2| The top plot depicts 
the daily metric M and the bottom plot depicts the cumulative 
electricity cost of the heat-pump thermostat. The average 
metric M over the simulation horizon is 0.56 for standard FQI 
and 0.71 for extended FQI, which is an improvement of 27%. 
The performance gap of 0.29 between extended FQI and the 
optimal controller is a reasonable result given that the model 
dynamics and disturbances are unknown, and that exploration 
is included. 

Extended FQI was able to decreases the total electricity 
cost with 19% compared to the default controller over the 
total simulation horizon, whereas standard FQI decreased the 
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Fig. 3. Simulation results for an electric water heater and dynamic pricing 
scheme. The top depicts the original policies and the middle row depicts the 
repaired policies for day 7, 14, and 21. The price profile con'esponding to 
each policy is depicted in the background. The bottom plot illustrates the 
impact of adding expert knowledge on the objective. 
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Fig. 4. The top plot depicts the day-ahead consumption plan, actual 
consumption, and day-ahead price. The middle plot depicts the actual and 
optimal indoor temperature. The left bottom plot depicts the metric M. 
The daily deviation between the day-ahead consumption plan and actual 
consumption is given in the right bottom plot. 


total electricity cost with 14%. It is important to note that the 
reduction of 19% is not a result of lower energy consumption, 
since the energy consumption increased with 4% compared to 
the default controller when extended FQI was used. 

With these experiment we showed that we successfully 
extended fitted Q-iteration to incorporate a forecasts of the 
exogenous data. 

C. Experiment 2 

The following experiment demonstrates the policy adjust¬ 
ment method for an electric water heater. As an illustrative 
example we used a sinusoidal price prohle. As stated in ( fTSl l. 
the state space of an electric water heater consists of two 
dimensions, i.e. the time component and the average temper¬ 
ature. Here we enforce a monotonicity constraint along the 
second dimension, which contains the average temperature. 
The original policies after 7, 14 and 21 days obtained with 
fitted Q-iteration can be seen in the top row of plots of Fig. [3 
It can be seen that the original policies, obtained by fitted Q- 
iteration, violate the monotonicity constraints along the second 
dimension in several states. The adjusted policies obtained by 
the policy adjustment method are depicted on the middle row 
of plots. The simulation results indicate that when the number 
of tuples in the batch increases, the original and adjusted 
policies converge. The bottom plot depicts the cumulative cost 
of fitted Q-iteration with and without expert knowledge over 
the simulation horizon. The policy adjustment method was 
able to reduce the total objective by 11% over 60 days. The 
results conclude that when the number of tuples in T is small, 
the expert policy adjustment method can be used to improve 
the performance of standard fitted Q-iteration. 

D. Experiment 3 

The hnal experiment demonstrates the Model-Free Monte 
Carlo (MFMC) method (Algorithm |2|i for hnding the day- 


ahead consumption plan of a heat-pump thermostat. The state 
variable is dehned by ( fT^ . and the cost function is dehned 
by (fT2l i. The parameter a was set to 10^ to penalize possible 
deviations between the planned consumption prohle and the 
actual consumption. The results of the experiment are depicted 
in Fig. |4] The parameter p, which indicates the number of 
artihcial trajectories, was set to 4. A day-ahead consumption 
plan was obtained by taking the average of these 4 trajectories. 
In order to assess the performance of the MFMC method, we 
dehned an optimal controller that can exploit the model and 
has prescient knowledge on the internal heat gains. The top 
plot depicts the day-ahead consumption plan and actual con¬ 
sumption prohle of a mature controller that contains a batch 
of 60 days. The corresponding indoor air temperature of the 
presented controller and optimal controller is depicted in the 
middle plot. In order to compare the daily cost of the MFMC 
method to the optimal controller, we dehne M = cmfmc/coc, 
where cmfmc is the daily cost of the MFMC method and Coc is 
the daily cost of the optimal controller. The metric M for each 
day and the deviation between the day-ahead consumption 
plan and the actual followed consumption are depicted in the 
bottom plot. As the bottom plot indicates, the daily deviations 
decrease as the number of days in the batch increases. The 
mean metric M over the whole simulation period, including 
the exploration phase is 0.81. The results of the last experi¬ 
ment imply that the model-free Monte Carlo method can be 
successfully used to construct a forward consumption plan for 
the next day. 

VI. Conclusion 

Driven by the challenges presented by the system identifica¬ 
tion step of model-based controllers, this paper contributed to 
the application of model-free batch Reinforcement Learning 
(RL) techniques to a demand response setting. Motivated by 
the fact that some demand response applications, e.g a heat- 
pump thermostat, are influenced by exogenous weather data. 
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we adapted a standard batch RL technique, fitted Q-iteration, 
to incorporate a forecast of the exogenous data. Numerical 
results have been presented that indicate that the proposed 
extension of fitted Q-iteration was able to improve the per¬ 
formance of standard fitted Q-iteration by 27%. In general, 
batch RL techniques do not require any prior knowledge on the 
system behavior or the solution. However, for some demand 
response applications, expert knowledge about the monotonic¬ 
ity of the solution, i.e. the policy, can be available. As such, we 
presented an expert policy adjustment method that can exploit 
this expert knowledge. The results of an experiment with an 
electric water heater indicate that the policy adjustment method 
was able to reduce the cost objective by 11% compared to 
fitted Q-iteration without expert knowledge. A final challenge 
for model-free batch RL techniques is that of finding a day- 
ahead consumption plan, i.e. an open-loop solution. In order 
to solve this problem we presented a model-free Monte Carlo 
method and we successfully tested the method for finding the 
day-ahead consumption plan of a heat pump. 

Our future research in this area will focus on employing the 
presented algorithms in a realistic lab environment. 
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